-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds index usage info to get_index_information(include_stats=True)
#5320
Conversation
WalkthroughThe pull request introduces enhancements to index information retrieval in the FiftyOne library. The changes focus on expanding the Changes
Possibly related PRs
Suggested reviewers
Poem
Tip CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/unittests/dataset_tests.py (2)
Line range hint
680-681
: Consider verifying unique indexing constraints.You create two indexes on the same field (
gt.detections.label
andframes.gt.detections.label
). If the code or future changes require uniqueness or partial indexes, explicitly test those aspects so that incorrect index usage does not slip by undetected.
Line range hint
697-700
: Validate structure of index statistics.By checking
'size'
,'ops'
, and'since'
keys, you confirm critical fields exist. Consider additional type or range checks (e.g., verifying'ops'
is an integer) to tighten correctness if needed.fiftyone/core/collections.py (1)
Line range hint
9509-9566
: Architecture advice for index statisticsThe implementation provides valuable index usage information but could benefit from some additional features:
- Consider caching the index stats for a configurable duration to avoid frequent aggregation queries
- Add helper methods to analyze index usage patterns and provide recommendations
- Consider adding index size thresholds/alerts for large indexes
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
fiftyone/core/collections.py
(3 hunks)fiftyone/core/dataset.py
(2 hunks)tests/unittests/dataset_tests.py
(2 hunks)
🔇 Additional comments (9)
tests/unittests/dataset_tests.py (3)
674-674
: Good test naming.
The descriptive method name clearly conveys the purpose of testing index statistics.
675-677
: Efficient sample setup.
Reusing the same gt
object across both the sample and its frames is a straightforward approach, but ensure this does not unintentionally mutate gt
if future tests modify the label object in-place.
Line range hint 695-696
: Comprehensive index presence checks.
These lines effectively verify that all expected indexes have indeed been created. This is a robust assertion step that ensures the test can catch regressions where indexes are removed or renamed.
fiftyone/core/dataset.py (2)
1205-1205
: Ensure _sample_collection
is never None
If _sample_collection
is unexpectedly None or invalid, this line would raise an error. Consider verifying that the attribute is properly initialized before returning its stats.
1211-1211
: Initialize frame collection before usage
If _frame_collection
is None for non-video datasets, this call may raise an exception. Consider returning early or adding a check before calling _get_collstats
.
fiftyone/core/collections.py (4)
9509-9510
: Documentation improvement for include_stats
parameter
The docstring update accurately reflects that the method now returns size, usage and build status information for each index when include_stats=True
.
9531-9537
: Added index usage statistics via $indexStats
aggregation
The code now retrieves index usage statistics by using MongoDB's $indexStats
aggregation operator, which provides information about how indexes are being used. This is a good addition that helps users understand index utilization patterns.
9560-9566
: Consistent implementation of index stats for frame collections
The same index usage statistics gathering logic is properly applied to frame collections, maintaining consistency with the sample collection implementation.
Line range hint 9509-9566
: Verify index stats aggregation performance
The added $indexStats
aggregation could potentially impact performance on large collections. Let's verify this:
✅ Verification successful
$indexStats
aggregation is safe to use
Based on the codebase analysis, the $indexStats
aggregation is safely implemented and well-tested:
- The feature is used only to gather metadata about existing indexes (size, usage, build status) and not for querying data
- It's part of the core MongoDB functionality for index management
- The implementation is covered by unit tests in
tests/unittests/dataset_tests.py
withtest_index_stats
- The aggregation is only executed when explicitly requested via
include_stats=True
parameter - The operation is lightweight as it only returns index metadata from MongoDB's system collections
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if indexStats aggregation is performant on large collections
# Look for large collections (>100k documents)
mongo_expr='{"$expr": {"$gt": [{"$size": "$samples"}, 100000]}}'
rg --json "$mongo_expr"
# Look for collections with many indexes
ast-grep --pattern 'create_index($$$)'
Length of output: 300
Script:
#!/bin/bash
# Search for index creation patterns and performance considerations
rg -A 3 "create_index\("
# Look for any performance-related comments or documentation about indexStats
rg -i "indexstats|performance|large collection"
# Check if there are any existing performance tests related to indexes
fd -e py test | xargs rg -l "index.*test|test.*index"
Length of output: 71279
@@ -9528,6 +9528,13 @@ def get_index_information(self, include_stats=False): | |||
if key in sample_info: | |||
sample_info[key]["in_progress"] = True | |||
|
|||
for d in self._dataset._sample_collection.aggregate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it would be useful to pull this out and make this a function that accepts the collection name as an arg to allow for creating something like an admin dashboard to manage indexes across datasets without needing to load each dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case makes sense 👍 I'd suggest we consider whether a potential refactor adds clarity at that time
Change log
get_index_information(include_stats=True)
collstats()
calls with the$collStats
aggregation, as the former has been deprecated since MongoDB v6.2Dataset._frame_indexes
propertyExample usage
collstats
->$collStats
migrationSummary by CodeRabbit
New Features
Bug Fixes
Refactor